Rethinking the Inception Architecture for Computer Vision

11.Conclusion

The combination of lower parameter count and additional regularization with batch-normalized auxiliary classifiers and label-smoothing allows for training high quality networks on relatively modest sized training sets.

最後の文でLabel Smoothingに言及

7.Model Regularization via Label Smoothing

kはラベル（1〜K）

p(k|x)：訓練サンプルxについてモデルが計算するラベルkの確率

ground truthはq(k)=δ_k,y

k=yなら1、それ以外は0

Dirac deltaと言及（クロネッカーのデルタと同じと思われる）

Consider a distribution over labels u(k), independent of the training example x, and a smoothing parameter ε.

ラベルの分布を置き換える

q(k|x)=δ_k,y から q′(k|x) = (1 − ε)δ_k,y + εu(k) へ

ε * u(k)分加わる

fixed distribution u(k)

In our experiments, we used the uniform distribution u(k) = 1/K

We refer to this change in ground-truth label distribution as label-smoothing regularization, or LSR.